19. Gini Impurity
Gini Impurity
So far, you've seen how to use entropy to calculate the information gain of a split. There is another alternative for measuring the quality of a split.
If there are k classes, and \hat{p}_k is the fraction of observations from class k classified by a node, we can calculate G (the Gini index) for the node:
The Gini index takes on a small value if all of the proportions are close to zero or one. You can think of it as a measure of node purity —if the value is small, the node mostly contains observations from a single class. It turns out that the Gini index and entropy are quite similar numerically.
To measure the increase in purity of a split using the Gini index, calculate the Gini index on the parent node and subtract the weighted average of the Gini indexes of the child nodes:
Scikit-learn supports both the Gini impurity and information gain metrics for evaluating the quality of splits, via the
criterion
hyperparameter.
Gini impurity